Skip to content

feat: page-bounded Arrow decoder per data page (PR-6a.2)#6407

Draft
g-talbot wants to merge 4 commits into
gtt/column-page-stream-traitfrom
gtt/parquet-page-decoder
Draft

feat: page-bounded Arrow decoder per data page (PR-6a.2)#6407
g-talbot wants to merge 4 commits into
gtt/column-page-stream-traitfrom
gtt/parquet-page-decoder

Conversation

@g-talbot
Copy link
Copy Markdown
Contributor

@g-talbot g-talbot commented May 8, 2026

Summary

  • Rebuilds the page-stream → Arrow decoder to be page-boundedStreamDecoder::decode_next_page() returns one [DecodedPage] per call (rg_idx, col_idx, page_idx, row_start, ArrayRef) instead of materialising an entire row group at a time.
  • Memory: ~one in-flight page (compressed + decompressed bytes) + one cached dictionary page per (rg, col) when dict-encoded. The decoder does NOT buffer a row group, column chunk, or any materialised array beyond the one currently being emitted.
  • PR-6b's merge engine consumes [DecodedPage]s in storage order (row-group-major, column-major-within-rg, page-major-within-col), applies merge plan slicing per page, and streams output pages directly into the writer without column-chunk staging.

How it works

  1. Pull one [Page] from the underlying [ColumnPageStream]. Skip INDEX_PAGE (historical Thrift variant, not emitted by production writers).
  2. Look up or initialise per-(rg, col) state: a PageQueue that feeds parquet-rs's [ColumnReader] one page at a time, plus a counter tracking rows decoded so far.
  3. Convert the [Page] to parquet-rs's column::page::Page enum: decompress via [parquet::compression::create_codec] (requires the experimental feature), translate format::Encoding (Thrift wrapper) → basic::Encoding (Rust enum) via a manual i32 match (no public conversion in parquet-rs), drop optional statistics.
  4. Push the converted page onto the queue. Dictionary/index pages absorb silently for use by subsequent data pages.
  5. For a data page: ask the [ColumnReader] to decode exactly header.num_values records via read_records(...) calls in a loop, pulling values + def/rep levels into typed buffers.
  6. Build an ArrayRef from (values, def_levels, rep_levels) per the column's parquet physical type. Emit [DecodedPage].

Type coverage

Flat physical types: Boolean, Int8/16/32 + UInt8/16/32 (parquet Int32 with logical annotation), Int64/UInt64 (parquet Int64), Float32, Float64, Utf8/LargeUtf8/Binary/LargeBinary (parquet ByteArray). Dictionary-encoded pages are decoded via the cached dict page → values pipeline.

List<T> / LargeList<T> where outer + inner are non-nullable and inner is a flat primitive — covers DDSketch keys (List<Int16>) and counts (List<UInt64>). Dremel def/rep levels (max_def=1, max_rep=1) are decoded via the same read_records path; arrow offsets are computed via list_offsets_from_levels.

Other nested shapes (nullable list inner/outer, Struct, Map, FixedSizeList, multi-leaf nested) return an unsupported-type error rather than silently falling back to a different mechanism.

Sync ⇄ async bridging

The page stream is async (S3 reads); PageReader (from parquet-rs) is sync. Bridged via Arc<Mutex<VecDeque<ColumnPage>>> per (rg, col): decode_next_page pulls from the stream (async), pushes onto the queue, then the sync PageReader impl pops from the queue when the ColumnReader asks for the next page. peek_next_page and skip_next_page are properly implemented to support the parquet-rs reader's state machine.

Schema handling

parquet_to_arrow_schema(parquet_schema, None) bypasses the ARROW:schema hint that would otherwise force Dictionary types — input parquet files written from arrow declare Dictionary columns in ARROW:schema metadata, but their page-encoded values are plain values that decode to StringArray/etc. Decoding without the hint gives consistent flat-primitive output that the merge engine then interleaves.

Tests

9 tests, all passing:

  • test_drain_single_rg_round_trip, test_drain_multi_rg_round_trip — full round-trip via decode_next_page matches ParquetRecordBatchReaderBuilder.
  • test_decoded_page_row_indexingrow_start correctly tracks per-(rg, col) row offsets.
  • test_eof_idempotent — repeated calls after EOF stay Ok(None).
  • test_nullable_column_round_trip — def-level decoding for nullable cols.
  • test_compression_codecs — snap, gzip, zstd round-trip.
  • test_page_bounded_queue_depth — verifies the internal page queue depth stays ≤ 2 across a long stream (the page-bounded contract).
  • test_list_uint64_round_tripList<UInt64> (DDSketch shape) round-trip.
  • test_io_failure_surfaces_as_page_stream_error — body GET failures propagate as PageStream(Io), not masked as decode errors.

Stack

Base: gtt/column-page-stream-trait (PR-5a #6406).

PR-6b (#6409) builds the streaming merge engine on top of this decoder.

@g-talbot g-talbot force-pushed the gtt/column-page-stream-trait branch from af78c8f to 2714921 Compare May 8, 2026 20:49
@g-talbot g-talbot force-pushed the gtt/parquet-page-decoder branch from 4bc7122 to 5d5c4b1 Compare May 8, 2026 20:49
@g-talbot g-talbot force-pushed the gtt/column-page-stream-trait branch from 2714921 to 61b6310 Compare May 8, 2026 21:27
@g-talbot g-talbot force-pushed the gtt/parquet-page-decoder branch from 5d5c4b1 to e660f78 Compare May 8, 2026 21:28
@g-talbot g-talbot force-pushed the gtt/column-page-stream-trait branch from 61b6310 to 4ae07e7 Compare May 8, 2026 21:46
@g-talbot g-talbot force-pushed the gtt/parquet-page-decoder branch from e660f78 to fcfb854 Compare May 8, 2026 21:46
@g-talbot g-talbot force-pushed the gtt/column-page-stream-trait branch from 4ae07e7 to d43186d Compare May 9, 2026 00:07
@g-talbot g-talbot force-pushed the gtt/parquet-page-decoder branch from fcfb854 to 736ce0e Compare May 9, 2026 00:08
@g-talbot g-talbot force-pushed the gtt/column-page-stream-trait branch from d43186d to 11b9d53 Compare May 11, 2026 11:06
@g-talbot g-talbot force-pushed the gtt/parquet-page-decoder branch from 736ce0e to 7bcf723 Compare May 11, 2026 11:06
@g-talbot g-talbot force-pushed the gtt/column-page-stream-trait branch from 11b9d53 to 123ed7e Compare May 11, 2026 11:14
g-talbot and others added 2 commits May 11, 2026 07:15
Bridges PR-4's ColumnPageStream (raw compressed pages in storage order)
to arrow's standard ParquetRecordBatchReaderBuilder (decoded arrays).
PR-6's streaming merge engine drains each input row-group through this
to keep per-RG memory bounded — only one input RG worth of bytes is
materialised at a time, rather than the whole file.

Approach: reconstruct one row group's column-chunk byte layout in a
buffer (column chunks placed at their original offsets, gaps zero-
padded), wrap the buffer in `Bytes`, and feed it to
`ParquetRecordBatchReaderBuilder::new_with_metadata` with
`with_row_groups([rg_idx])`. Byte-exact reconstruction by carrying
each page's original Thrift-compact `header_bytes` through PR-4's
streaming reader — no re-encoding, so encoder-version drift inside
the compactor cannot silently corrupt outputs.

Adds `header_bytes: Bytes` to `Page` and captures the drained
header bytes inside `parse_page_header_streaming`. New
`StreamDecoder` borrows the stream and exposes `next_rg()` returning
one `RecordBatch` per input row group, idempotent at EOF.

Tests (9, all passing): single-RG and multi-RG drains, multi-page
columns, dict columns, null preservation, compression codec roundtrip
(uncompressed/snappy/zstd — LZ4 not enabled in our parquet feature
set), idempotent EOF, byte-exact reconstruction proof, and I/O failure
surfacing as PageDecodeError::PageStream rather than masked as decode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI nightly rustfmt (newer than my local at the time of the original
push) wraps `write_parquet(...)` onto multiple lines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@g-talbot g-talbot force-pushed the gtt/parquet-page-decoder branch from 7bcf723 to 67ac5b0 Compare May 11, 2026 11:15
g-talbot and others added 2 commits May 11, 2026 09:47
Replaces PR-6a's per-RG fat-buffer approach. The previous implementation
reconstructed a whole row group's column-chunk bytes into a single
buffer and fed it to ParquetRecordBatchReaderBuilder — peak memory was
RG-size (tens to hundreds of MB per call). This rewrite is
page-bounded.

API change: \`StreamDecoder::next_rg() -> Option<RecordBatch>\` is
replaced by \`decode_next_page() -> Option<DecodedPage>\`. Each call
returns one input data page's worth of decoded rows as an
\`ArrayRef\`, plus \`(rg_idx, col_idx, page_idx_in_col, row_start)\`
indexing so PR-6b's merge engine can slice take indices per page.
Dictionary pages are absorbed silently (fed to the column reader for
subsequent data-page decoding); INDEX_PAGE is skipped.

Memory at any time:
- One in-flight page (compressed + decompressed bytes)
- One cached dictionary page per (rg, col) when dict-encoded
- One column reader per (rg, col) with small bookkeeping (level
  decoders, value decoder)

Does NOT buffer the row group, a column chunk, or a materialised
RecordBatch.

Implementation: wraps parquet-rs's public \`GenericColumnReader\` over
a per-(rg, col) PageQueue we feed one page at a time. Page → ColumnPage
conversion handles decompression (via \`compression::create_codec\`,
which required enabling parquet's \`experimental\` feature on our
Cargo.toml — the API has been stable across recent parquet-rs versions,
just not yet de-experimentalised), \`format::Encoding\` (Thrift wrapper)
→ \`basic::Encoding\` translation, and DataPageV2's
unencrypted-levels-then-compressed-values layout.

Array builders cover the production schema: Boolean, Int8/16/32/64,
UInt8/16/32/64, Float32/64, Utf8/LargeUtf8/Binary/LargeBinary, and
\`List<non-nullable primitive>\` (DDSketch \`keys\` / \`counts\`). Dict
columns decode to their value type (Utf8/Binary); the merge engine's
union schema normalises strings to Utf8 anyway, and the output writer
re-applies dict encoding based on observed cardinality.

Tests (9, all passing):
- single-RG and multi-RG round-trip (per-column comparison vs. canonical
  arrow reader)
- per-page indexing (\`row_start\`, \`page_idx_in_col\` monotonic
  per-(rg, col))
- idempotent EOF
- nullable column (\`service\` with nulls every 5th row)
- compression codecs (uncompressed, snappy, zstd)
- I/O failures surface as \`PageDecodeError::PageStream\`
- \`List<UInt64>\` (DDSketch \`counts\`) with variable list lengths
  including empty list and \`u64::MAX\`
- structural page-bounded contract: PageQueue depth ≤ 2 (one queued
  dictionary plus the current data page) across a long stream

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI's `cargo +nightly fmt --check` flags a single trailing blank
line at end of file. No functional change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@g-talbot g-talbot changed the title feat: page-stream → RecordBatch decoder (PR-6a) feat: page-bounded Arrow decoder per data page (PR-6a.2) May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant